Authors


Hector R. Gavilanes Chief Information Officer
Gail Han Chief Operating Officer
Michael T. Mezzano Chief Technology Officer


University of West Florida

November 2023

Agenda

  • Introduction
  • Method
  • Example
  • Application
  • Discussion

Principal Component Analysis (PCA)

  • Unsupervised Machine Learning
  • Dimensionality Reduction Technique
  • Data Exploration
  • Feature Extraction
  • Data Visualization
  • Simplification of complex dataset
  • Principal Component (PC): Capture variances explaining the original variables
  • Mitigate multicollinearity

Multivariate Approaches

Methods

  • Data matrix \(X\) of size \(N\) x \(P\).
  • Data is linearly related.
  • Continuous and normally distributed data.
    • Initial data distribution does not truly matter.
  • Variables are similar in scale and without extreme outliers.
  • Missing data: Imputation or removal of observations.
  • Centering and scaling: Transform variables to a mean of 0 and a standard deviation of 1. \[ z_{np} = \frac{x_{np} - \bar{x}_{p}}{{\sigma_{p}}} \]
  • Covariance: A measure of how two random variables vary together. \[ Cov(x,y) = \frac{\Sigma(x_i-\bar{x})(y_i-\bar{y})}{N} \]
  • Covariance Matrix: Symmetric \(p \times p\) matrix which gives the covariance values for each pair of variables in the dataset.
  • Nonzero vector whose direction is unaffected by a linear transformation.
  • An eigenvector is scaled by factor \(\lambda\), the eigenvalue.
  • Each principal component is given by the eigenvectors of the covariance matrix.
    • The eigenvectors represent the directions of the new principal axes.
    • The eigenvalues represent the magnitude of these eigenvectors.

Finding the Principal Components

  • Find the linear combination of the columns of \(X\) (the variables) which maximizes variance.
  • Let \(a\) be a vector of constants \(a_1, a_2, a_3, …, a_p\) such that \(Xa\) represents the linear combination which maximizes variance.
  • The variance of \(Xa\) is represented by \(var(Xa) = a^TSa\) with the covariance matrix \(S\).
  • Finding the \(Xa\) with maximum variance equates to finding the vector \(a\) which maximizes the quadratic \(a^TSa\), where \(a^Ta = 1\).
  • \(a\) is a unit-norm eigenvector with eigenvalue \(\lambda\) of the covariance matrix \(S\).
  • The largest eigenvalue of \(S\) is \(\lambda_1\) with the eigenvector \(a_1\), which we can define for any eigenvector \(a\): \[ var(Xa) = a^TSa = \lambda a^Ta = \lambda \]

Principal Components

  • Impose the restriction of orthogonality to the coefficient vectors of \(S\).
    • Ensure the principal components are uncorrelated.
  • Each \(Xa_k\) is a principal component of the dataset having eigenvectors \(a_k\) and eigenvalues \(\lambda_k\).
  • The elements of \(Xa_k\) are the factor scores of the PCs.
  • The elements of the eigenvectors \(a_k\) represent the loadings of the PCs.
  • The elements of \(Xa_k\).
  • How each observation scores on a PC.
  • Represents the projection of the original observations onto the PCs.
  • The elements of the eigenvectors \(a_k\).
  • The loadings represent the weights of the original variables in the computation of the PCs.
  • Eigenvectors: Represent directions of maximum variance.
  • Eigenvalues: Indicate the variance explained by each eigenvector.

Example

  • Abalone dataset from the UCI Machine Learning Repository.
  • 4177 observations of 9 variables which record characteristics of each abalone including sex, length, diameter, height, weights, and the number of rings.
  • The variables, apart from sex, are continuous and correlated.

Preprocessing the data

  • Exclude non-numeric variables from the dataset.
    • The variable Sex is excluded.
  • Check for missing data.
    • No missing data in the dataset.
  • Scale and center the data.
  • Check for and handle extreme outliers.
    • Outliers do not present a large problem.

Perform Principal Component Analysis

The prcomp() function performs principal component analysis on a dataset using the singular value decomposition method with the covariance matrix of the data.

  • The standard deviation for each PC represents the information captured by that principal component.
  • The proportion of variance is the percent of total variance captured by each PC.
  • The cumulative proportion gives the total variance captured by the PC and all prior PCs.

Visualizing the results

Interpreting the results

  • The loadings of the first two principal components show the contribution of each variable to PC1 and PC2.

Dataset - State Averages of Common Dialysis Quality Measures

  • Source: Center for Medicare & Medicaid Services (Data.CMS.gov)
  • data collected from 50 US states + 6 U.S. territories
  • 39 variables
    • 24 patient care quality in dialysis facilities
    • 14 characteristics of dialysis patients
    • Response variable: As Expected Survival
    • Index variable: categorical variable was removed

Dataset Summary

Dataset Selection Rationale

  • Driven by multicollinearity.

  • Features less significant in explaining variability.

  • All variables are numeric

  • Categorical Index variable.

Data Preparation

  • Efficient removal of white spaces in the dataset.

  • Editing variable names to enhance readability and meaningful.

Original: “Percentage.Of.Adult..Patients.With.Hypercalcemia..Serum.Calcium.Greater.Than.10.2.Mg.dL.”

Edited: “hypercalcemia_calcium > 10.2Mg.”

Missing Values

  • 34 missing values.

  • Imputation of missing values using the \(Mean\) (\(\mu\))

Distribution

  • Normality is not assumed.

QQ-Plot of Residuals

  • Outliers are present through the entire dataset

Standardization

  • Mean (\(\mu\)=0); Standard Deviation (\(\sigma\)= 1)

    \[ Z = \frac{{ x - \mu }}{{ \sigma }} \]

    \[ Z \sim N(0,1) \]

Outliers & Leverage

  • 3 Outliers

  • No leverage

  • Minimal difference.

  • No observations removed.

Correlations

  • Multicollinearity is present.
  • Threshold = 0.30.
  • 28 Correlated features.

Scree Plot

  • PC1 explains 40.8% variance.
  • PC2 explains 9.5% variance.

BiPlot

  • PC1 in black displays longest distance of its projection.
  • PC2 in blue displays shorter distance as expected.

Contribution of Variables

Modeling

# reproducible random sampling
set.seed(my_seed)  
 
# Create Target y-variable for the training set
y <- train_data$expected_survival  
# Split the data into training and test sets 
split <- sample.split(y, SplitRatio = 0.7) 
training_set <- subset(train_data, split == TRUE) 
test_set <- subset(train_data, split == FALSE) 
# Perform centering and scaling on the training and test sets
sc <- preProcess(training_set[, -target_index], 
                 method = c("center", "scale"))
training_set[, -target_index] <- predict(
  sc, training_set[, -target_index])
test_set[, -target_index] <- predict(sc, test_set[, -target_index])
# Perform Principal Component Analysis (PCA) preprocessing on the training data
pca <- preProcess(training_set[, -target_index], 
                  method = 'pca', pcaComp = 8)

# Apply PCA transformation to original training set
training_set <- predict(pca, training_set)

# Reorder columns, moving the dependent feature index to the end
training_set <- training_set[c(2:9, 1)]

# Apply PCA transformation to original test set
test_set <- predict(pca, test_set)

# Reorder columns, moving the dependent feature index to the end
test_set <- test_set[c(2:9, 1)]

Uncorrelated Matrix

8 Principal Components

PC Regression

8 Components

2 Components

Cross-Validation Model

# Cross-validation with n folds
k_10 <- trainControl(method = "cv", number = 10)

# training the model 
model_cv <- train(expected_survival ~ ., 
                  data = train_pca,
                  method = "lm",
                  trControl = k_10)

# Print Model Performance
print(model_cv)

Results

  • PCA was performed using SVD.
  • PC1 captures 40.80% of the variance in the data.
    • PC1 and PC2 capture 50.27% of the variance.
    • Over 90% of the information in the dataset can be explained by the first eleven PCs.
  • The variables which contribute the most to PC1 are
    • expected_hospital_readmission
    • expected_transfusion
    • expected_hospitalization
  • PC2 has large contributions from variables measuring phosphorus.
  • PCR was performed with expected_survival used as the response.
  • Estimates and significance of each PC regressor have key differences.
    • For example, PC3 is not a significant regressor while PC4 is.
  • Both models produced an \(R^2\) above 96% and a predicted \(R^2\) above 95% with a 1% advantage on the cross-validation model.

Discussion

  • Principal Component Analysis is an essential multivariate analysis technique
  • Covariance matrix -> SVD -> Principal Components
  • Scaling Data
  • Sensitive to outliers
  • Loss of interpretability in transformed features.
  • Loss of Information

PCA is a definitely a useful tool to have in your toolkit!

Questions and Comments

References

[1] H. Abdi and L. J. Williams, “Principal component analysis,” WIREs Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010, doi: https://doi.org/10.1002/wics.101. Available: https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/wics.101

[2] S. R. Bennett, Linear algebra for data science. 2021. Available: https://shainarace.github.io/LinearAlgebra/index.html

[3] R. Bro and A. K. Smilde, “Principal component analysis,” Analytical methods, vol. 6, no. 9, pp. 2812–2831, 2014.

[4] F. Chumney, “PCA, EFA, CFA,” pp. 2–3, 6, Sep., 2012, Available: https://www.westga.edu/academics/research/vrc/assets/docs/PCA-EFA-CFA_EssayChumney_09282012.pdf

[5] D. Esposito and F. Esposito, Introducing machine learning. Microsoft Press, 2020.

[6] B. Everitt and T. Hothorn, An introduction to applied multivariate analysis with r. Springer Science & Business Media, 2011.

[7] R. A. Fisher and W. A. Mackenzie, “Studies in crop variation. II. The manurial response of different potato varieties,” The Journal of Agricultural Science, vol. 13, no. 3, pp. 311–320, 1923.

[8] F. L. Gewers et al., “Principal component analysis: A natural approach to data exploration,” ACM Computing Surveys (CSUR), vol. 54, no. 4, pp. 1–34, 2021.

[9] M. Greenacre, P. J. Groenen, T. Hastie, A. I. d’Enza, A. Markos, and E. Tuzhilina, “Principal component analysis,” Nature Reviews Methods Primers, vol. 2, no. 1, p. 100, 2022.

[10] B. M. S. Hasan and A. M. Abdulazeez, “A review of principal component analysis algorithm for dimensionality reduction,” Journal of Soft Computing and Data Mining, vol. 2, no. 1, pp. 20–30, 2021.

[11] J. Hopcroft and R. Kannan, Foundations of data science. 2014.

[12] H. Hotelling, “Analysis of a complex of statistical variables into principal components.” Journal of educational psychology, vol. 24, no. 6, p. 417, 1933.

[13] I. T. Jolliffe and J. Cadima, “Principal component analysis: A review and recent developments,” Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences, vol. 374, no. 2065, p. 20150202, 2016.

[14] J. Lever, M. Krzywinski, and N. Altman, “Points of significance: Principal component analysis,” Nature methods, vol. 14, no. 7, pp. 641–643, 2017.

[15] D. G. Luenberger, Optimization by vector space methods. John Wiley & Sons, 1997.

[16] J. Maindonald and J. Braun, Data analysis and graphics using r: An example-based approach, vol. 10. Cambridge University Press, 2006.

[17] J. Pagès, Multiple factor analysis by example using r. CRC Press, 2014.

[18] K. Pearson, “LIII. On lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin philosophical magazine and journal of science, vol. 2, no. 11, pp. 559–572, 1901.

[19] F. Pedregosa et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.

[20] M. Ringnér, “What is principal component analysis?” Nature biotechnology, vol. 26, no. 3, pp. 303–304, 2008.

[21] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991.

[22] S. Zhang and M. Turk, “Eigenfaces,” Scholarpedia, vol. 3, no. 9, p. 4244, 2008.

[23] E. K. CS, “PCA problem / how to compute principal components / KTU machine learning.” YouTube, 2020. Available: https://youtu.be/MLaJbA82nzk

[24] S. Nash Warwick and W. Ford, “Abalone.” UCI Machine Learning Repository, 1995. Available: https://doi.org/10.24432/C55C7W

[25] “Quarterly dialysis facility care compare (QDFCC) report: July 2023.” Centers for Medicare & Medicaid Services (CMS). Available: https://data.cms.gov/provider-data/dataset/2fpu-cgbb. [Accessed: Oct. 11, 2023]

[26] R Core Team, “R: A language and environment for statistical computing.” R Foundation for Statistical Computing, Vienna, Austria, 2023. Available: https://www.R-project.org/

[27] A. Kassambara and F. Mundt, Factoextra: Extract and visualize the results of multivariate data analyses. 2020. Available: https://CRAN.R-project.org/package=factoextra

[28] H. Wickham et al., “Welcome to the tidyverse,” Journal of Open Source Software, vol. 4, no. 43, p. 1686, 2019, doi: 10.21105/joss.01686

[29] H. Zhu, kableExtra: Construct complex table with ’kable’ and pipe syntax. 2021. Available: https://CRAN.R-project.org/package=kableExtra

[30] D. Comtois, Summarytools: Tools to quickly and neatly summarize data. 2022. Available: https://CRAN.R-project.org/package=summarytools

[31] T. Wei and V. Simko, R package ’corrplot’: Visualization of a correlation matrix. 2021. Available: https://github.com/taiyun/corrplot

[32] B. Cui, DataExplorer: Automate data exploration and treatment. 2020. Available: https://CRAN.R-project.org/package=DataExplorer

[33] A. Hebbali, Olsrr: Tools for building OLS regression models. 2020. Available: https://CRAN.R-project.org/package=olsrr

[34] Kuhn and Max, “Building predictive models in r using the caret package,” Journal of Statistical Software, vol. 28, no. 5, pp. 1–26, 2008, doi: 10.18637/jss.v028.i05. Available: https://www.jstatsoft.org/index.php/jss/article/view/v028i05

[35] D. Lüdecke, sjPlot: Data visualization for statistics in social science. 2023. Available: https://CRAN.R-project.org/package=sjPlot

[36] J. Tuszynski, caTools: Tools: Moving window statistics, GIF, Base64, ROC AUC, etc. 2021. Available: https://CRAN.R-project.org/package=caTools

[37] A. Kassambara, Ggcorrplot: Visualization of a correlation matrix using ’ggplot2’. 2023. Available: https://CRAN.R-project.org/package=ggcorrplot

[38] K. H. Liland, B.-H. Mevik, and R. Wehrens, Pls: Partial least squares and principal component regression. 2023. Available: https://CRAN.R-project.org/package=pls

[39] B. Hamner and M. Frasco, Metrics: Evaluation metrics for machine learning. 2018. Available: https://CRAN.R-project.org/package=Metrics